Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++
نویسندگان
چکیده
This paper describes an initial study where the opensource Hadoop parallel and distributed run-time environment is used to speed-up the construction phase of a large high-dimensional index. This paper first discusses the typical practical problems developers may run into when porting their code to Hadoop. It then presents early experimental results showing that the performance gains are substantial when indexing large data sets.
منابع مشابه
High Scalability of HDFS using Distributed Namespace
In data intensive computing, Hadoop is widely used by organizations. The client applications of Hadoop require high availability and scalability of the system. Mostly, these applications are online and their data growth rate is unpredictable. The present Hadoop relies on secondary namenode for failover which slows down the performance of the system. Hadoop system’s scalability depends on the ve...
متن کاملLive Website Traffic Analysis Integrated with Improved Performance for Small Files using Hadoop
Hadoop, an open source java framework deals with big data. It has HDFS (Hadoop distributed file system) and MapReduce. HDFS is designed to handle large amount files through clusters and suffers performance penalty while dealing with large number of small files. These large numbers of small files pose a heavy burden on the NameNode of HDFS and an increase execution time for MapReduce. Secondly, ...
متن کاملNameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System
Current works on power-proportional distributed file systems have not considered the cost of updating data sets that were modified (updated or appended) in a low-power mode, where a subset of nodes were powered off. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. This paper presents a no...
متن کاملOptimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the storage layer for Apache Hadoop ecosystem, persisting large data sets across multiple machines. However, the overall storage capacity is limited since the metadata is stored in-memory on a single server, called the NameNode. The heap size of the NameNode restricts the number of data files and addressable blocks persisted in the file system. The H...
متن کاملCeph as a scalable alternative to the Hadoop Distributed File System
[email protected] THE HADOOP D I S TR I BUTED F I L E System (HDFS) has a single metadata server that sets a hard limit on its maximum size. Ceph, a high-performance distributed file system under development since 2005 and now supported in Linux, bypasses the scaling limits of HDFS. We describe Ceph and its elements and provide instructions for installing a demonstration system that can be used...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012